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Abstract Optimization approaches based on operator splitting are becoming popu¬ 
lar for solving sparsity regularized statistical machine learning models. While many 
have proposed fast algorithms to solve these problems for a single regularization 
parameter, conspicuously less attention has been given to computing regularization 
paths, or solving the optimization problems over the full range of regularization pa¬ 
rameters to obtain a sequence of sparse models. In this chapter, we aim to quickly 
approximate the sequence of sparse models associated with regularization paths for 
the purposes of statistical model selection by using the building blocks from a clas¬ 
sical operator splitting method, the Alternating Direction Method of Multipliers 
(ADMM). We begin by proposing an ADMM algorithm that uses warm-starts to 
quickly compute the regularization path. Then, by employing approximations along 
this warm-starting ADMM algorithm, we propose a novel concept that we term the 
ADMM Algorithmic Regularization Path. Our method can quickly outline the se¬ 
quence of sparse models associated with the regularization path in computational 
time that is often less than that of using the ADMM algorithm to solve the problem 
at a single regularization parameter. We demonstrate the applicability and substan¬ 
tial computational savings of our approach through three popular examples, sparse 
linear regression, reduced-rank multi-task learning, and convex clustering. 
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1 Introduction 


With the rise of Big Data and the subsequent explosion of statistical machine learn¬ 
ing methods to analyze it, statisticians have become avid consumers of large-scale 
optimization procedures to estimate sparse models. The estimation problem is often 
cast as an optimization problem of the form: 


minimize 

P 


L(/3;W)-flP(/3), 


( 1 ) 


where /3 is a parameter which specifies a statistical model, L(j3;W) is a smooth 
loss function or data-fidelity term that quantihes the discrepancy between the data, 
W, and the model specihed by P, and P{P) is a nonsmooth penalty that encour¬ 
ages sparsity in model parameter P HIllBl. A regularization parameter, A > 0, 
explicitly trades off the model fit and the model complexity. 

Directly solving the optimization problem ([T]) is often challenging. Operator 
splitting methods, such as the Alternating Direction Method of Multipliers (ADMM), 
have become popular because they convert solving the problem into solving a se¬ 
quence of simpler optimization problems that involve only the smooth loss or nons¬ 
mooth penalty. By breaking up the problem into smaller ones, ADMM may end up 
taking more iterations than directly solving Q, but it often runs in less total time 
since the subproblems are typically easy to solve. Clearly in the context of Big Data, 
faster algorithms are indispensable, and the numerical optimization community has 
devoted a great deal of effort to solving Q rapidly for a hxed value of X. This 
goal, however, is not necessarily aligned with the application of statistical machine 
learning problems to real data. 

In practice, statisticians are interested in finding the best sparse model that repre¬ 
sents the data. Achieving this typically entails a two-step procedure: (i) model selec¬ 
tion, or selecting the best sparse model or equivalently the best subset of parameters, 
and (ii) model fitting, or htting the model by minimizing the loss function over the 
selected parameters ca. The hrst step is often the most challenging computation¬ 
ally as this entails searching the combinatorial space of all possible sparse models. 
As this combinatorial search is infeasible for large-scale problems, many consider 
convex relaxations through constraints or penalties as computationally feasible sur¬ 
rogates to help search through the space of sparse models. Consider for example, 
sparse linear regression, where the goal is to hnd the subset of variables or inputs 
that best predicts the response or output. Searching over all possible subsets of vari¬ 
ables, however, is an NP hard problem. Instead, many have employed the penalty 
or constraint, P(j3) = ||j3|ji, which is the tightest convex relaxation to performing 
best subset selection and whose solution can be computed in polynomial time. The 
nonsmooth penalty term, P{P), then serves to translate an infeasible computational 
problem into a tractable one for model selection purposes. 

Suppose now that we focus on selecting the best sparse model by means of penal¬ 
ized statistical estimation as in Q. As X varies, we trace out a continuous parametric 
curve P{X) S K'’. Since this curve cannot be determined analytically in general, the 
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curve is estimated for a finite sequence of regularization parameters. To choose the 
best model, statisticians inspect the sequence of sparse solutions to ([T]) over the full 
range of regularization parameters; {^(A„) ; 0 < Ai < • • • < Amax}, where Amax is 
the value of A at which ^(Ai^iax) = 0 , the maximally sparse solution. This sequence 
of sparse solutions is often called the regularization path HSlIIolIIl. For model 
selection purposes, however, the actual parameter values, ^(A), as A varies in the 
regularization paths are less important than identifying the non-zero components 
of /3(A). (Note that the parameter values for the optimal model are typically re-ht 
anyways in the second model htting stage.) Instead, the support of ^(A) or the se¬ 
quence of active sets dehned as ^/(A) = {7 : )3y(A) 7 ^ 0 }, are the important items; 
these yield a good sequence of sparse models to consider that limit computationally 
intensive exploration of a combinatorial model space. Out of this regularization path 
or sequence of active sets, the optimal model can be chosen via a number of popular 
techniques such as minimizing the trade-off in model complexity as with the AIC 
and BIC, the prediction error as with cross-validation m or the model stability as 
with stability selection E^ . 

To apply sparse statistical learning methods to large-scale problems, we need fast 
algorithms not only to ht 0 for one value of A, but to estimate the entire sequence 
of sparse models in the model selection stage. Our objective in this chapter is to 
study the latter, which has received relatively little attention from the optimization 
community. Specihcally, we seek to develop a new method to approximate the se¬ 
quence of active sets associated with regularization paths that is (i) computationally 
fast and (ii) comprehensively explores the space of sparse models at a sufficiently 
hne resolution. In doing so, we will not try to closely approximate the parameter 
values, /3(A), but instead try to closely approximate the sparsity of the parameters, 
. 2 A(A), for the statistical learning problem 0 . 

To rapidly approximate the sequence of active sets associated with regularization 
paths, we turn to the ADMM optimization framework. We hrst introduce a proce¬ 
dure to estimate the regularization path by using the ADMM algorithm with warm 
starts over a range of regularization parameters to yield a path-like sequence of solu¬ 
tions. Extending this, we preform a one-step approximation along each point on this 
path, yielding the novel method that we term ADMM Algorithmic Regularization 
Paths. Our procedure can closely approximate active sets given by regularization 
paths at a hne resolution, but dramatically reduces computational time. This new 
approach to estimating a sequence of sparse models opens many interesting ques¬ 
tions from both statistical and optimization perspectives. In this chapter, however, 
we focus on motivating our approach and demonstrating its computational advan¬ 
tages on several sparse statistical machine learning examples. 

This chapter is organized as follows. We hrst review how ADMM algorithms 
have been used in the statistical machine learning literature. Section o Then, to 
motivate our approach, we consider application of ADMM to the familiar exam¬ 
ple of sparse linear regression. Section |1.2| In Section we introduce our novel 
Algorithmic Regularization Paths for general sparse statistical machine learning 
procedures. We then demonstrate how to apply our methods through some popu- 
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lar machine learning problems in Section ^ specifically, we consider three exam¬ 
ples - sparse linear regression (Section!^ i, reduced-rank multi-task learning (Sec¬ 
tion |3.2| i, and convex clustering (Section |3.3| l - where our Algorithm Paths yield 
substantial computational benehts. We conclude with a discussion of our work and 
the many open questions it raises in Sectionj^ 


1.1 ADMM in Statistical Machine Learning 

The ADMM algorithm has become popular in statistical machine learning in recent 
years because the resulting algorithms are typically simple to code and can scale ef- 
hciently to large problems. Although ADMM has been successfully applied over a 
diverse spectrum of problems, there are essentially two thematic challenges among 
the problems that ADMM has proven adept at addressing: (i) decoupling constraints 
and regularizers, that are straightforward to handle individually, but not in conjunc¬ 
tion; and (ii) simplifying fusion type penalties. We make note of these two types of 
problems because the ADMM Algorithmic Regularization Path we introduce in this 
chapter can be applied to either type of problem. 

An illustrative example of the first thematic challenge arises in sparse principal 
component analysis (PCA). In ll35l Vu et al. propose estimating sparse principal 
subspace estimator B of a symmetric input matrix S with the solution to the follow¬ 
ing semidefinite program: 

minimize — (S,B)-|-A||B||i subject to B G 

B 

where ||B|| i is 1-norm of the vectorization of B, the set = {B: 0^B^I,tr(B) = 

(f} is a closed and convex set called the Fantope, and A > 0 is a regularization pa¬ 
rameter. The main algorithmic challenge is the interaction between the Fantope con¬ 
straint and the £i-norm penalty. If only either the penalty or constraint were present 
the problem would be straightforward to solve. Consider the following equivalent 
problem to which ADMM can be readily applied: 

minimize 5^<;(B) — (S,B)-1-A||Z|| i subject to Z —B = 0, 

B 

where dc{E>) denotes the indicator function of the closed convex set C, namely the 
function that is 0 on C and °° otherwise. By minimizing an augmented Lagrangian 
over B, the copy variable Z, and the scaled dual variable U as outlined in 131, we 
arrive at the following ADMM updates: 
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= argmin ^||B- +p-'S)||^ + 5_^<,(B) = (Z'^-' -' +p-' 

B ^ 

Z'‘ = argmin ^\\Z\\i+^-\\B'‘+ V'‘-^-Z\\l = + V^-') 

= U^-' +B*-Z^ 

Thus, the penalty and constraint are effectively decoupled resulting in simple up¬ 
dates: the update for Z requires the soft-thresholding operator, (x) = sign(x) (|x| — 
p)+, and the update for B involves the projection onto the Fantope, denoted by 
which has a closed form solution given in lf35l . 

The literature abounds with many more examples of using the ADMM splitting 
strategy to decouple an otherwise challenging optimization problem into simpler 
subproblems. Boyd et al. lO review many such applications. Other example appli¬ 
cations include decoupling trace or nuclear norm penalties as in robust PC A ll42ll . 
latent variable graphical models ED, and tensor completion ll20l : decoupling differ¬ 
ent types of hierarchical constraints El. decoupling a series of loss functions ifTSlI . 
decoupling joint graphical models 0 , and decoupling large linear programming 
problems ID, among many others. 

The second thematic challenge that ADMM algorithms have been used to solve 
involve fusion or non-separable penalties. A good illustrative example of this chal¬ 
lenge arises in total variation (TV) denoising ED. Consider the simple version of 
this problem, specifically finding a smooth estimate of a noisy one-dimensional sig¬ 
nal yS'K": 


1 

minimize - ||y- /31|^ -f 1 ^ |/3; - /3,+i |, 

P ^ i=l 

where the tuning parameter A > 0 trades off the smoothness of the approximation 
with the goodness of fit with the data y. What makes this problem challenging is 
that the fusion penalty couples the non-smooth terms so that they are non-separable. 
Note that this penalty can be written more compactly as ||Ax|ji where A is the 
discrete first order differences operator matrix. More generally, this second class 
of problems consist of problems of the form, L(j3;W) + XP{AP). In the machine 
learning context these penalties arise because we often wish to impose structure, not 
on a latent variable of interest directly, but rather on a linear transformation of it. In 
the TV denoising example we seek sparsity in differences of adjacent time points of 
the latent signal. 

Previously, we could break the objective into a sum of simpler objectives. The 
issue here is different; specifically the composition of the regularizer with a linear 
mapping complicates matters. ADMM can again greatly simplify this problem if we 
let the ADMM copy variable copy the linearly transformed parameters: 

minimize L(j3;W)-f Af’(z) subject to z-A^=0. 

P 
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The ADMM subproblems for iteratively solving this problem then have the follow¬ 
ing simple form: 

= argmin L(/3; W)-f J||A/3 +u^-' ||^ 

P 2 

z'= = argmin P(z)-f ^ ||z - Aj3* - ||^ 

= u^-i-P/3*^-z^ 

Note that we have eliminated having to minimize any functions containing the trou¬ 
blesome composition penalty. In the context of the TV denoising example, the 
update requires solving a linear system of equations, and the z update involves a 
straightforward soft-threshold. 

The ADMM algorithm has been used to decouple fusion or non-separable types 
of penalties in many statistical learning problems. These include more general in¬ 
stances of total variation MM, a convex formulation of clustering la , joint graph¬ 
ical model selection ll24ll^ . overlapping group lasso penalties ll40ll . and more gen¬ 
erally for structured sparsity ea. 

Overall, while the ADMM algorithm is gaining more widespread application in 
statistics and machine learning, the algorithm is applied in the traditional sense to 
solve a composite optimization problem for one value of the regularization param¬ 
eter. In this chapter, we seek to investigate the ADMM algorithm for a different 
purpose, namely to find a sequence of sparse models associated with regularization 
paths. 


1.2 Developing an Algorithmic Regularization Path: Sparse 
Regression 

Our goal is to quickly generate a sequence of candidate sparse solutions for model 
selection purposes. To achieve this, we will propose a method of approximating the 
sequence of active sets given by regularization paths, or the path-like sequence solu¬ 
tions of penalized statistical models over the full range of regularization parameters. 
To motivate our approach, we study the familiar example of sparse linear regression. 
Suppose we observe a covariate matrix X G consisting of n iid observations 
of p variables and an associated response variable y G 91". We are interested in fit¬ 
ting the linear model y = Xj3 -fe where e is independent noise, but assume that the 
linear coefficient vector is sparse, || j3||o<C p where H-Ho is the io norm or the 
number of non-zero elements of . Minimizing a criterion subject to a constraint 
of the form ||/3 ||o < k for some k, becomes a combinatorially hard task. To estimate 
a sparse model in reasonable time, many have proposed to use the tightest convex 
relaxation, the ^i-norm penalty, commonly called the LASSO lf3^ in the statistical 
literature: 
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minimize ;^||y-Xi3||^ + A||j3||i (2) 

p In 

where A > 0 is the regularization parameter controlling the sparsity of . 

The full regularization path of solutions for the LASSO is the set of regression co¬ 
efficients {$ (A): V 0 < A < Amax} where Amax = ^ ||X^y|joo is the smallest amount 
of regularization that yields the sparse solution /3 = 0. The regularization paths for 
the LASSO have been well-studied and in particular, are continuous and piece-wise 
linear ll^ l8l [30l . These paths also outline a sequence of active sets or sparse models 
that smoothly increase in sparsity levels as A decreases from the fully sparse solu¬ 
tion at A = Amax to the fully dense solution at A = 0. Hence for model selection, 
one can limit exploration of the combinatorial space of sparse models to that of the 
sequence of active sets outlined by the LASSO regularization paths. 

Computing the full regularization paths, however, can be a computational chal¬ 
lenge. Several path following algorithms for the LASSO ll^ IMl and closely re¬ 
lated algorithms such as such as Least Angle Regression (LAR) and Orthogo¬ 
nal Matching Pursuit (OMP) Q have been proposed; their computational complex¬ 
ity, however, is which is prohibitive for large-scale problems. Consequently, 

many have suggested to closely approximate these paths by solving a series of opti¬ 
mization problems over a grid of regularization parameter values. Specifically, this 
is typically done for a sequence of 100 log-spaced values from Amax to Ai = 0. 
Statisticians often employ homotopy, or warm-starts, to speed computation along 
the regularization path 0. Warm-starts use the solution from the previous value of 
as the initialization for the optimization algorithm to solve the problem 
at Xj+\. As the coefficients, j3, change continuously in A, warm-starts can dramati¬ 
cally reduce the number of iterations needed for convergence as P{Xj) is expected 
to be close to /3(A/+i) for small changes from Aj to Aj+i. Many consider shooting 
methods, or coordinate descent procedures 0|32l, that use warm-starts and iterate 
over the active set for 100 log-spaced values of A ifTOl to be the fastest approximate 
solvers of the LASSO regularization path. 

We seek to further speed the computation of the sequence of active sets given 
by the regularization path by using a single path approximating algorithm instead 
of solving separate optimization problems over a grid of regularization parameter 
values. Our approach is motivated by two separate observations: (i) the evolution 
of the sparsity level of the iterates of the ADMM algorithm used to fit (|^ for one 
value of A, and (ii) the behavior of a new version of the ADMM algorithm that 
incorporates warm-starts to expedite computation of regularization paths. We study 
each of these motivations separately, beginning with the first. 

Consider using ADMM to solve the LASSO problem. First, we split the differen¬ 
tiable loss function from the non-differentiable penalty term by introducing a copy 
z of the variable in the penalty function, and adding an equality constraint forcing 
them to be equal. The LASSO problem (|^ can then be re-written as: 

minimize —lly —Xj3||2 + A||z|ji subjectto i3=z, 
pz 2n 
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with its associated augmented Lagrangian; 

= ^lly-Xi3||^ + A||z||i + |||i3-z + u||i 

Here, u is the scaled dual variable of the same dimension as and p is the algorithm 
tuning parameter. The ADMM algorithm then follows three steps (subproblems) to 
solve the LASSO: 

j3-subproblem: = argmin l||y-X/3||^ + £||/3 -z^-' 

^ Zfi L 

z-subproblem: z^ = argmin A||z||i -f ^||z ——u ^^'||2 

Dual update: -1-/3*^ - z*. 

The benefit of solving this reformulation is simpler iterative updates. These three 
steps are iterated until convergence, typically measured by the primal and dual resid¬ 
uals 0. The j3 -subproblem solves a linear regression with an additional quadratic 
ridge penalty. Solving the z-subproblem introduces sparsity. Notice that this is the 
proximal operator of the ^i-norm applied to which is solved analytically via 

soft-thresholding.Finally, the dual update ensures that is squeezed towards z and 
primal feasibility as the algorithm progresses. 

Consider the sparsity of the z iterates, ||z*||o, for the LASSO problem. Notice 
that as the algorithm proceeds, z^ becomes increasingly sparse; this is illustrated 
for a small simulated example in the left panel of Figure Let us study why this 
occurs and its implications. Regardless of A, the ADMM algorithm begins with a 
fully dense as this is the solution to a ridge problem with parameter p. Soft- 
thresholding in the z-subproblem then sets coefficients of small magnitude to zero. 
The first dual update, has magnitude at most |A|, meaning that the second 
update is essentially shrunken towards P '. Smaller coefficients decrease further in 
magnitude and soft-thresholding in the z-subproblem sets even more coefficients to 
zero. The algorithm thus proceeds until the sparsity of the z^ stabilizes to that of 
the solution, ^(A). Hence, the support of the z* has approximated the active set of 
the solution long before the iterates of the j3-subproblem; the latter typically does 
not reach the sparsity of the solution until convergence when primal feasibility is 
achieved. While Figure only illustrates that z^ quickly converges to the correct 
sparsity level, we have observed empirically in all our examples that the active set 
outlined by ||z^|jo also quickly identifies the true non-zero elements of the solution, 

3(A). 

Interestingly then, the z^ quickly explore a sequence of sparse models going from 
dense to sparse, similar in nature to the sequence of sparse models outlined by the 
regularization path. While from Figure we can see that this sequence of sparse 
models is not desirable as it does not smoothly transition in sparsity and does not 
fully explore the sparse model space, we nonetheless learn two important items from 
this: (i) We are motivated to 
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consider using the algorithm iterates of the z-subproblem, as a possible means 
of quickly exploring the sparse model space; and (ii) we are motivated to consider 
a sequence of solutions going from dense to sparse as this naturally aligns with the 
sparsity levels observed in the ADMM algorithm iterates. Given these, we ask: Is it 
possible to use or modify the iterates of the ADMM algorithm to achieve a path-like 
smooth transition in sparsity levels similar in nature to the sparsity levels and active 
sets corresponding to regularization paths? 





Iteration Iteration 


Fig. 1 Sparsity levels outlining a sequence of active sets for a simulated sparse linear regression 
example. (Left) Sparsity levels of the z-subproblem iterates of the ADMM algorithm, ||z*^||o! At for 
one fixed value of A. (Middle) Sparsity levels of the z-subproblem over the iterates of our path 
approximating ADMM Algorithm with Warm Starts for a small range of A. Vertical lines denote 
the start of the algorithm for an increased value of A. (Right) Sparsity levels of the z-subproblem 
over the iterates of our novel ADMM Algorithmic Regularization Path. 


One possible solution would be to employ warm-starts in the ADMM algorithm 
along a grid of regularization parameters similar to other popular algorithms for ap¬ 
proximating regularization paths. Recall that warm-starts use the solution obtained 
at the previous value of A as initialization for the next value of A. We first introduce 
this new extension of the ADMM algorithm for approximating regularization paths 
in Algorithmic and then return to our motivation of studying the sequence of active 
sets outlined by this algorithm. 

Our ADMM algorithm with warm starts is an alternative algorithm for fitting 
regularization paths. It begins with A small corresponding to a dense model, fits 
the ADMM algorithm to obtain the solution, and then uses the previous solution, 
j3(Ay_i), and dual variable, u(Ay_i), to initialize the ADMM algorithm for Ay. 

Before considering the sequence of active sets outlined by this algorithm, we 
pause to discuss some noteworthy features. First, notice that the ADMM tuning 
parameter, p, does not appear in this algorithm. We have omitted this as a param¬ 
eter by fixing p — I throughout the algorithm. Fixing p stands in contrast with the 
burgeoning literature on how to dynamically update p for ADMM algorithms lO . 
For example, adaptive procedures that change p to speed up convergence are intro¬ 
duced in ifT^ . Others have proposed accelerated versions of the ADMM algorithm 
that achieve a similar phenomenon ini. Changing the algorithm tuning parame¬ 
ter, however, is not conducive to achieving a path-like algorithm using warm-starts. 
Consider the z-subproblem which is solving by soft-thresholding at the level Ay/p. 
Thus, if p is changed in the algorithm, the sparsity levels of z dramatically change. 
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Algorithm 1 ADMM with Warm Starts: Sparse Regression 

1. Initialize = 0, = 0, and M log-spaced values, X = {Ai < A 2 < ... < Xm}, for Ai = 0 

and Xf^ — Xrriax- 

2. Precompute matrix inverse H = (X^X/n-|-I)“* and HX^y. 

3. for j = 1.. .M do 

while ||r*^j| A ||s*|| > e*°* do 

^J = HXTy + H{z‘-u‘-') 

Uy = —Zy 

r* = ^5-z5ands* = z*-z*-' 
it = yt+l 

end while 
end for 

4. Output {P j '- y = 1, • ■ ■ 1 ^} as the regularization path. 


eliminating the advantages of using warm-starts to achieve smooth transitions in 
sparsity levels. Second, notice that we have switched the order of the sub-problems 
by beginning with the z-subproblem. While technically the order of the subproblems 
does not matter lIMl . we begin with the z-subproblem as this is where the sparsity is 
achieved through soft-thresholding at the value. A; hence, the solution for z is what 
changes when X is increased. 

Next, notice that our regularization paths go from dense to sparse, or A small 
to large, which is the opposite of other path-wise algorithms and algorithms that 
approximate regularization paths over a grid of A values Qol. Recall that our ob¬ 
jective is to obtain a smooth path-like transition in sparsity levels corresponding 
to a sequence of active sets that fully explores the space of sparse models. Our 
warm-start procedure naturally aligns with the sparsity levels of the iterates of the 
ADMM algorithm which go from dense to sparse, thus ensuring a smooth transition 
in the sparsity level of z as A is increased. Our warm-start procedure could certainly 
be employed going in the reverse direction from sparse to dense, but we have ob¬ 
served that this introduces discontinuities in the z^ iterates and consequently their 
active sets as well, thus requiring more iterations for convergence. This behavior 
occurs as the solution of the ^-subproblem is always more dense than that of the 
z-subproblem, even when employing warm-starts. 

Now, let us return to our motivation and consider the sparsity levels and cor¬ 
responding sequence of active sets achieved by the iterates of our new path- 
approximating ADMM Algorithm. The sparsity of the iterates of the z-subproblems, 
||z^||o, are plotted for 30 log-spaced values of A for the same simulated example in 
the middle panel of Figure The iterates over all values of A are plotted on the x- 
axis with vertical lines denoting the increase to the next A value. Carefully consider 
the sparsity levels of the z iterates for each fixed value of A in our ADMM algorithm 
with warm starts. Notice that the sparsity levels of z typically stabilize to that of the 
solution within the first few iterations after A is increased. The remaining iterations 
and a large proportion of the computational time are spent on squeezing towards 
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z to satisfy primal feasibility. This means that the z-subproblem has achieved the 
sparsity associated with the active set of ^ (A) within a few iterations of increas¬ 
ing A. One could surmise that if the increase in A were small enough, then the 
z-subproblem could correctly approximate the active set corresponding to A within 
one iteration when using this warm-start procedure. The right panel of Figure[T]illus- 
trates the sparsity levels achieved by the z-subproblem for this sequence of one-step 
approximations to our ADMM algorithm with warm-starts. Notice that this proce¬ 
dure achieves a smooth transition in sparsity levels corresponding to a sequence of 
active sets that fully explore the range of possible sparse models, but requires only 
a fraction of the total number of iterations and compute time. This, then is the moti¬ 
vation for our new ADMM Algorithmic Regularization Paths introduced in the next 
section. 


2 The Algorithmic Regularization Path 

Our objective is to use the ADMM splitting method as the foundation upon which 
to develop a new approximation to the sequence of sparse solutions outlined by reg¬ 
ularization paths. In doing so, we are not interested in estimating parameter values 
by solving a statistical learning optimization problem with high precision. Instead, 
we are interested in quickly exploring the space of sparse model at a fine resolution 
for model selection purposes by approximating the sequence of active sets given by 
the regularization path. 

Again, consider the general sparse statistical machine learning problem of the 
following form; 

minimize L(j3; W)-I-AP(j3), 

P 

where W denotes the “data” (for the sparse linear regression example, W = {X, y}), 
the loss function, L(j3;W) is a differentiable, convex function of j3, and the regular¬ 
ization term, P : 91^ —> 91+ is a convex and non-differentiable penalty function. As 
before, A > 0 is the regularization parameter controlling the trade-off between the 
penalty and loss function. Following the setup of the ADMM algorithm, consider 
splitting the smooth loss from the nonsmooth penalty through the copy variable, z: 

minimize L(j3;W)-I-AP(z) subject to j3 = z, (3) 

With scaled dual variable u, the augmented Lagrangian of general problem Q is 

, z, u) = L(j3; W) + AF(z)-f 111 j3 - z + u|| i 

Now following from the motivations discussed in the previous section, there are 
three key ingredients that we employ in our Algorithmic Regularization Paths: (i) 
warm-starts to go from a dense to a sparse solution, (ii) the sparsity patterns of 
the z-subproblem iterates, and (iii) one-step approximations at each regularization 
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level. We put these ingredients together in Algorithm to give our Algorithmic 
Regularization Paths: 


Algorithm 2 Algorithmic Regularization Path for Sparse Statistical Learning 

1. Initialize z® = 0, u® = 0, 7 ” = e, ^ = 1, and set t > 0. 

2. While II z*^ 11 7^0 

/ = +t (or / = /“‘f) 

P'‘ = minimize L(^;W) + jj|^ +u*“‘ Wl 

P 

z* = minimize 7*P(z) + j ||z —jS* — II 2 (Record z* at each iteration.) 

U*: _ y<:-l 

k = k+l 

end 

3. Output {z* : A: = 1, • • • W} as algorithmic regularization path . 


Our Algorithmic Regularization Path, Algorithm Path for short, outlines a se¬ 
quence of sparse models going from fully dense to fully sparse. This can be used 
as an approximation to the sequence of active sets given by regularization paths for 
the purpose of model selection. Consider that the algorithm begins will the fully 
dense ridge solution. It then gradually increases the amount of regularization, 7 , 
performing one full iterate of the ADMM algorithm (j3-subproblem, z-subproblem, 
and dual update) for each new level of regularization. The regularization level is 
increased until the z-subproblem becomes fully sparse. 

One may ask why we would expect our Algorithm Path to yield a sequence of ac¬ 
tive sets that well approximate those of the regularization path? While a mathemat¬ 
ical proof of this is beyond the scope of this chapter, we outline the intuition stem¬ 
ming from our three key ingredients. (Note that we also demonstrate this through 
specific examples in the next section). 

(i) Warm-starts from dense to sparse. Beginning with a dense solution and grad¬ 
ually increasing the amount of regularization ensures a smooth decrease in the 
sparsity levels coii'esponding to a smooth pruning of the active set as this natu¬ 
rally aligns with sparsity levels of the ADMM algorithm iterates. 

(ii) z-subproblem iterates. The iterates of the z-subproblem encode the sparsity of 
the active set, 0 (A), quickly as compared to the -subproblem which achieves 
sparsity only in the limit upon algorithm convergence. 

(iii) One-step approximations. For a small increase in regularization when using 
warm-starts, the iterates of the z-subproblem often achieve the sparsity level of 
the active set within one-step. 

Notice that if we iterated the three subproblems of our Algorithm Path fully until 
convergence, then our algorithm would be equivalent to our ADMM Algorithm with 
warm starts; thus, the one-step approximation is the major difference between Al¬ 
gorithms [T] and Because of this one-step approximation, we are not fully solving 
0 and thus the parameter values, J3, will never stabilize to that of the regularization 
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path. Instead, our Algorithm Path quickly approximates the sequence of active sets 
corresponding to the regularization path, as encoded in the z-subproblem iterates. 

The astute reader will notice that we have denoted the regularization parameters 
in Algorithm as 7 instead of A as in Q. This was intentional since due to the 
one-step approximation, we are not solving Q and thus the level of regularization 
achieved, 7 , does not correspond to A from Q. Also notice that we have introduced 
a step size, f, that increases the regularization level, 7 , at each iteration. The sequence 
of 7 ’s can either be linearly spaced, as with additive f, or geometrically spaced, as 
with multiplicative t. Again, if t is very small, then we expect the sparsity patterns 
of our Algorithm Paths to well approximate the active sets of regularization paths. 

We will explore the behavior and benefits of our Algorithm Paths through demon¬ 
strations on popular sparse statistical learning problems in the next section. Before 
presenting specihc examples, however, we pause to outline three important advan¬ 
tages that are general to sparse statistical learning problems of the form 0 . 

1. Easy to implement. Our Algorithm Path is much simpler than other algorithms 
to approximate regularization paths. The hardest parts, the j3 and z subprob¬ 
lems, often have analytical solutions for many popular statistical learning meth¬ 
ods. Then, with only one loop, our algorithm can often be implemented in a few 
lines of code. This is in contrast to other algorithm paths which require multiple 
loops and much overhead to track algorithm convergence or the coordinates of 
active sets Qo). 

2. Finer resolution exploration of the model space. Our Algorithm Path has the 
potential to explore the space of sparse models at a much finer resolution than 
other fast methods for approximating regularization paths over a grid of A val¬ 
ues. Consider that as the later are computed over M, typically M = 100, A val¬ 
ues, these can yield an upper bound of M distinct active sets; usually these yield 
much less than M distinct models. In contrast, our Algorithm Path yields an 
upper bound of K distinct models where K is the number of iterations needed, 
depending on the step-size f, to fully explore the sequence of sparse models. 
As K will often be much greater than M, our Algorithm Path will often explore 
a sequence of many more active sets and at a finer resolution than comparable 
methods. 

3. Computationally fast. Our Algorithm Path has the potential to yield a sequence 
of sparse solutions much faster than other methods for computing regulariza¬ 
tion paths. Consider that our algorithm takes at most K iterations. In con¬ 
trast, regularization paths of a grid of M A values require M times the num¬ 
ber of iterations needed to fully estimate ^(Ay); often this will be much larger 
than K. In each iteration of our algorithm, the and z subproblems require 
the most computational time. The subproblem consists of the loss func¬ 
tion with a quadratic penalty which can be solved via an analytical form for 
many loss functions. The z subproblem has the form of the proximal operator 
of P 1291 : prox;Lp(x) = argminu||x —u ||2 -P AP(u). For many popular convex 
penalty-types such as the ^i-norm, group lasso, and nuclear norm, the proximal 
operator has an analytical solution. Thus, for a large number of statistical ma- 
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chine learning problems, the iterations of our Algorithm Path are inexpensive 
to compute. 

Overall, our Algorithmic Regularization Paths give a novel method for hnding 
a sequence of sparse solutions by approximating the active sets of regularization 
paths. Our methods can be used in place of regularization paths for model selection 
purposes with many sparse statistical learning problems. In this chapter, instead 
of studying the mathematical and statistical properties of our new Algorithm Paths, 
which we leave for future work, we study our method through applications to several 
statistical learning problems in the next section. 


3 Examples 

To demonstrate the versatility and advantages of our ADMM Algorithmic Regu¬ 
larization Paths, we present several example applications to sparse statistical learn¬ 
ing methods: sparse linear regression, reduced-rank multi-task learning and convex 
clustering. 


3.1 Sparse Linear Regression 

As our first example, we revisit the motivating example of sparse linear regression 
discussed in Section [L2] We reproduce the problem here for convenience: 

minimize ^||y-Xj3||2 + A||/3||i 

And, our Algorithmic Regularization Path for this example is presented in Algo¬ 
rithm |3] 


Algorithm 3 Algorithmic Regularization Path for Sparse Regression 

1. Initialize z® = 0, u° = 0, y® = e, ^ = 1, and set f > 0. 

2. Precompute matrix inverse H = (X^ X/n-l-I)-' andHX'^y. 

3. While II z*^ 11 7^0 

■f = 'jA-i -\-t 

^*^ = HXTy-yH(z*-'-u*-') 

z* = Sy( (/3* -I- *) (Record at each iteration.) 

U* _ u*:-l 

k = k+i 

end 

4. Output {z^ ■. k = \ ,K} as the algorithmic regularization path . 
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Let us first discuss computational aspects of our Algorithm Path for sparse linear 
regression. Notice that the -subproblem consists of solving a ridge-like regres¬ 
sion problem. Much of the computations involved, however, can be pre-computed, 
specifically the matrix inversion, (X^X/n-1-1)^*, and matrix-vector multiplica¬ 
tion, X^y. In cases where p ^ n, inverting a p x p matrix is is highly com¬ 
putationally intensive, requiring ^{p^) operations. We can reduce the computa¬ 
tional cost to however, by invoking the Woodbury Matrix Identity ifTSll : 

(X^X/n-|-Ip) ^ = Ip — X^(«I„-1-XX^)^*X and caching the Cholesky decom¬ 
position of the smaller n-hy-n matrix nI„-fXX^. Thus, the iterative updates for 

are reduced to the cost of solving two n-hy-n triangular linear systems 

of equations. The z-subproblem is solved via soft-thresholdingwhich requires only 
G’{p) operations. 

We study our Algorithmic Regularization Path for sparse linear regression through 
a real data example. We use the publicly available 14-cancer microarray data from 
lITSl to form our covariate matrix. This consists of gene expression measurements 
for n = 198 subjects and 16063 genes; we randomly sample p = 2000 genes to use 
as our data matrix X. We simulate sparse true signal J3* with s = 16 non-zero fea¬ 
tures of absolute magnitude 5-10, and with the signs of the non-zero signals assigned 
randomly; the 16 non-zero variables were randomly chosen from the 2000 genes. 
The response variable y is generated as y -l-e, where e N{0,1). A visual¬ 

ization of regularization paths, stability paths, and our Algorithmic Regularization 
Paths is given in Figure |^for this example. 

First, we verify empirically that that our ADMM algorithm with warm starts 
is equivalent to the regularization path (top left and top middle). Additionally, no¬ 
tice that, as expected, our Algorithm Path with a tiny step size (bottom right) also 
well approximates the sequence of active sets given by the regularization paths. 
With a larger step-size, however, our algorithm path (bottom left) yields a sequence 
of sparse models that differ markedly from the sparsity patterns of the regulariza¬ 
tion paths. This occurs as the change in regularization levels of each step are large 
enough so that the sparsity levels of the z-subproblem after the one-step approxima¬ 
tion are not equivalent to that of the solution to (|^. 

Despite this. Figure suggests that our Algorithm Paths with larger step sizes 
may have some advantages in terms of variable selection. Notice that regulariza¬ 
tion paths select many false positives (blue and gray dashed lines) before the true 
positives (red lines). This is expected as we used a real microarray data set for X 
consisting of strongly correlated variables that directly violate the irrepresentable 
conditions under which variable selection for sparse regression is achievable 11. 
Our method, however, selects several true variables before the hrst false positive 
enters the model. To understand this further, we compare our approach to the Sta¬ 
bility Paths used for stability selection l23l . a re-sampling scheme with some of the 
strongest theoretical guarantees for variable selection. The stability paths, however, 
also select several false positives. This as well as other empirical results that are 
omitted for space reasons suggest that our Algorithm Path with moderate or larger 
step sizes may perform better than convex optimization techniques in terms of vari- 
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Algorithm Path Algorithm Path with Tiny Step Size 




Fig. 2 Comparisons of Algorithmic Regularization Paths (bottom panel) to regularization paths 
(top left and middle) and stability paths (top right) for the sparse linear regression example. The 

lines denote false variables, — lines denote true non-zero variables, and-lines denote 

some highlighted false positives. Regularization paths were computed via the popular shooting 
method cni (top left) and our ADMM algorithm with warm-starts (top middle). Our Algorithmic 
Regularization Path with a tiny step size (bottom right) closely approximates the sparsity patterns 
of the regularization paths, while our method with a larger step size (bottom left) dramatically 
differs from the regularization paths. Notice that sparse regression in this example does a poor job 
of variable selection, selecting many false positives before any true features enter the model. Even 
the stability paths (top right) select many false positives. Our Algorithmic Regularization Path with 
a larger step-size, however, selects many of the true variables with much fewer false positives. 


able selection. While a theoretical investigation of this is beyond the scope of this 
book chapter, the intuition for this is readily apparent. Our Algorithm Path starts 
from a dense solution and uses a ridge-like penalty. Thus, coefficients of highly cor¬ 
related variables are likely to be grouped and have similar magnitude coefficient 
values. When soft-thresholding is performed in the z-subproblem, variables which 
are strongly correlated are likely to remain together in the model for at least the first 
several algorithm iterations. By keeping correlated variables together in the model 
longer and otherwise eliminating irrelevant variables, this gives our algorithm a bet¬ 
ter chance of selecting the truly non-zero variables out of a correlated set. Hence, 
the fact that we start with a dense solution seems to help us; this is in contrast to 
the LASSO, LAR and OMP paths which are initialized with an empty active set 
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and greedily add variables most correlated with the response [|28]|8l. We plan on 
investigating our methods in terms of variable selection in future work. 


Table 1 Timing comparison (averaged over 50 replications) of our ADMM Algorithmic Regu¬ 
larization Paths, Regularization Paths obtained from the shooting method (coordinate descent) & 
Stability Paths for different numbers of variables in the true model. 


Time (seconds) 

Algorithmic Regularization Path 

Regularization Path 

Stability Path 

s = 20, p = 4000 

0.0481 

0.1322 

36.6813 

s = 20, p = 6000 

0.0469 

0.1621 

43.9320 


Finally, we compare our Algorithm Paths to state-of-the-art methods for com¬ 
puting the sparse regression regularization paths in terms of computational time in 
Table[T] The regularization paths were computed using the glmnet R package ifTOl 
which is based on shooting (coordinate descent) routines ifTOll . This approach and 
software is widely regarded as one of the fastest solvers for sparse regression. No¬ 
tice that our Algorithm Paths, coded entirely in Matlab, run in about a hfth of the 
time as this state-of-the-art competitor. Also, our computational time is far superior 
to the re-sampling schemes required to compute the stability paths. 

Overall, our Algorithmic Regularization Path for sparse linear regression reveals 
major computational advantages for hnding a sequence of sparse models that ap¬ 
proximate the active sets of regularization paths. Additionally, empirical evidence 
suggests that our methods may also enjoy some important statistical advantages in 
terms of variable selection that we will explore in future work. 


3.2 Reduced-Rank Multi-Task Learning 

Our ADMM Algorithmic Regularization Path applies generally to many convex 
penalty types beyond the £i-norm. Here, we demonstrate our method in conjunction 
with a reduced-rank multi-task learning problem also called multi-response regres¬ 
sion. This problem has been studied by 1221 among many others. 

Suppose we observe n iid samples measured on p covariates and for q out¬ 
comes, yielding a covariate matrix, X € and a response matrix Y € 

Then, our goal is to ht the following statistical model: Y = XB+e, where B is the 
p X q coefficient matrix which we seek to learn, and e is independent noise. As 
often the number of covariates is large relative to the sample size, pq ^ n, many 
have suggested to regularize the coefficient matrix B by assuming it has a low- 
rank structure, rank(B) < pAq. Thus, our model space of sparse solutions is given 
by the space of all possible reduced-rank solutions. Exploring this space is an NP 
hard computational problem; thus, many have employed the nuclear norm penalty, 
||B||* = Ly=f<T/(B), which is the sum (or £i-norm) of the singular values of B, 
(j(B), and the tightest convex relaxation of the rank constraint. Thus, we arrive at 
the following optimization problem: 
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minimize ^||Y —XB||p + A||B||* (4) 

Here, || • ||f is the Frobenious norm, A > 0 is the regularization parameter controlling 
the rank of the solution and || -H* is the nuclear norm penalty. 

For model selection then, one seeks to explore the sequence of low-rank solu¬ 
tions obtained as A varies. To develop our Algorithm Path for approximating this 
sequence of low-rank solutions, let us consider the ADMM sub-problems for solv¬ 
ing 0. The augmented Lagrangian, sub-problems, dual updates are analogous to 
that of the sparse linear regression example, and hence we omit these here. Examin¬ 
ing the Z-subproblem, however, recall that this is the proximal operator for the nu¬ 
clear norm penalty: T} = argmin ^ ||Z —(B^ -|-U^)||p -f 7 ||Z||*, which can be solved 

z 

by soft-thresholding the singular values: Suppose that A = is the SVD of A. 

Then singular-value thresholding is defined as SVTy(A) = U[diag((<7 — 7)+)]V^ 
and the solution for the Z sub-problem is Z^ = SVTy(B^ -fU^). 


Algorithm 4 Algorithmic Regularization Path for Reduced-Rank Regression 

1. Initialize: Z** = 0, = 0, 7 ° = e, and step size ? > 0. 

2. Precompute: H= (X^X/n -I- 1)-' andAX^Y. 

3. While ||Z*|| 7^0 

yk — +r (or / = /“‘f). 

= HXTy + H(Z*-' -U*-'). 

Z*^ = SVTyi ). (Record Z* at each iteration.) 

U* = U*^'-|-B*^-U*^ 

end 

4. Output {Z* : I: = 1, • • • , as the algorithmic regularization path. 


Then, following the framework of the sparse linear regression example, our 
ADMM Algorithmic Regularization Path for the reduced-rank mutli-task learning 
(regression) is outlined in Algorithm]^ Notice that the algorithm has the same basic 
steps as in the previous example except that solving the proximal operator for the Z 
sub-problem entails singular value thresholding. This step is the most computation¬ 
ally time consuming aspect of the algorithm as K total SVDs must be computed to 
approximate the sequence of solutions. Also note that similarly to the sparse regres¬ 
sion example, the inversion needed, (X^X/n-|-I)^*,canbe precomputed by using 
the matrix inversion identities as previously discussed and cached as a convenient 
factorization; hence, this is computationally feasible even when n. 

To demonstrate the computational advantages of our approach, we conduct a 
small simulation study comparing our method to the two most commonly used al¬ 
gorithms for reduced-rank regression: proximal gradient descent and ADMM. First, 
we generate data according to the model: Y = XB-l-e, where X 200 XI 00 is generated 
as independent standard Gaussians, Biooxioo is an image of the Batman symbol, 
and e 200 x 100 is independent standard Gaussian noise. We set the signal in the coef¬ 
ficient matrix to be a low-rank image of the Batman symbol, rank(B) = 38, which 
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Estimated 2, Rank - 38 




Fig. 3 Reduced Rank Regression simulated example. The true coefficient matrix, B 6 g^iooxioo^ 
is an image of batman that is rank 38. Our Algorithm Path provides a sequence of low-ranks 
solutions at a fine resolution (top left) that well-approximate the low-rank signal; three such low- 
rank solutions (top right and bottom panel) are shown from iterates of our Algorithm Path. 



# Ranks Considered 

#SVDs 

Time in Seconds 

Algorithm Path 

90 

476 

2.354 

Proximal Gradient 

57 

2519 

12.424 

ADMM 

51 

115,946 

599.144 


Table 2 Algorithm comparisons for reduced rank regression example. 


can be well-approximated by further reduced rank images. We applied our Algo¬ 
rithmic Regularization Paths to this simulated example with 500 logarithmically- 
spaced values of 7 . Results are given in Figure]^ and show that our Algorithm Path 
smoothly explores the model space of reduced rank solutions and nicely approxi¬ 
mates the true signal as a low-rank batman image. We also conduct a timing com¬ 
parison to implementations of proximal gradient descent and ADMM algorithms 
using warm-starts for this same example; results are given in Table Here, we see 
that our approach requires much fewer SVD computations and is much faster than 
both algorithms, especially the ADMM algorithm. Additionally, both the ADMM 
and proximal gradient algorithm employed 100 logarithmically spaced values of the 
regularization parameter, X. With this, however, we see that not all possible ranks 
of the model space are considered, with proximal gradient and ADMM considering 
57 and 51 ranks out of 100 respectively. In contrast, our Algorithmic Regularization 
Path yields a sequence of sparse solutions at a much finer resolution, considering 90 
out of the 100 possible ranks. Thus, for proximal gradient and ADMM algorithms 
to consider the same range of possible sparsity levels (ranks), a greater number of 
problems would have to be solved over a much finer grid of regularization parame¬ 
ters, further inflating compute times. 
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Overall, our approach yields substantial computational savings for computing a 
sequence of sparse solutions for reduced rank regression compared to other state- 
of-the-art methods for this problem. 


3.3 Convex Clustering 

Our final example applies the ADMM Algorithmic Regularization path to an ex¬ 
ample with fusion type or non-separable penalties, namely a recently introduced 
convex formulation of cluster analysis Einiiii. Given n points yj,..., y„ in 
we pose the clustering problem as follows. Assign to each point y, its own clus¬ 
ter center S We then seek an assignment of that minimizes the distances 
between y, and and seeks sparsity between cluster center pairs jS,- and /3y. Com¬ 
puting all possible cluster assignments, however, is an NP hard problem. Hence, the 
following relaxation poses hnding the cluster assignments as a convex optimization 
problem: 


minimize -112, (5) 

where A is a positive regularization parameter, and Wij is a nonnegative weight. 
When A = 0, the minimum is attained when = y,-, and each point occupies a 
unique cluster. As X increases, the cluster centers begin to coalesce. Two points y,- 
and yj with jS,- = are said to belong to the same cluster. For sufficiently large X 
all points coalesce into a single cluster at y, the mean of the y,-. Because the objective 
in 0 is strictly convex and coercive, it possesses a unique minimizer for each value 
of A. This is in stark contrast to other typical criteria used for clustering, which often 
rely on greedy algorithms that are prone to get trapped in suboptimal local minima. 
Because of its coalescent behavior, the resulting solution path can be considered a 
convex relaxation of hierarchical clustering HU. 

This problem generalizes the fused LASSO 041 . and as with other fused LASSO 
problems, penalizing affine transformations of the decision variable makes mini¬ 
mization challenging in general. The one exception is when a 1-norm is used in¬ 
stead of the 2-norm in the fusion penalty terms. In this case, the problem reduces to 
a weighted one-dimensional total variation denoising problem. Under other norms, 
including the 2-norm, the situation, is salvageable if we adopt a splitting strategy 
discussed earlier in Section p~T| for dealing with fusion type or non-separable penal¬ 
ties. Briefly, we consider using the 2-norm in the fusion penalty to be most broadly 
applicable since the solutions to the convex clustering problem become invariant to 
rotations in the data. Consequently, clustering assignments will also be guaranteed 
to be rotationally invariant. 

Let the variables zij e ‘W record the differences between the /th and y'th points. 
We denote the collections of variables ILi {Zij}i<j by P and z respectively. 
Then the original problem can be reformulated as: 
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1 " 

minimize ||y,- -^,|l 2 lb subject to j3,-- j3 ■ -z,^ = 0. (6) 

^,=1 i<i 

Consider the ADMM algorithm derived in Q for solving (|^. Let u,j S 91^ denote 
the Lagrange multiplier for the i yth equality constraint. Let u denote the collection 
of variables {u,y },<y . The augmented Lagrangian is given by: 

^ 1=1 i<i ^ i<j 

Then, the three ADMM subproblems are given by: 

= argmin i£||y,.-^;||^ + i^||P,.-P^.-z,;+Uy||^ 

^ ^ i=i ^ i<j 

z^+' = argminA^w;;||z;j||2 + ^^||^i-^ —z,j+u,7||^ 

* i<j ^ i<j 

u^'=uJj + [z^'-(^f+'-j3f')]. 

Splitting the variables in this manner allows us to solve a series of straightforward 
subproblems. Updating /3 involves solving a ridge regression problem. Despite the 
fact that the quadratic penalty term is not separable in the P, after some algebraic 
maneuvering, which is detailed in El, it is possible to explicitly write down the 
updates for each j3 separately: 




1 


1 +n 1 +n 


1 


1 +n 


E[A+Ai-Ei“‘/+4i 


j>‘ j« 

Updating z requires minimizing an objective that separates in each of the z,j. 


a+i 


1 , 


z;y ‘ = argmin -||z;j- -ufj||^+Aw,;||Zy|| 2 . 

7;; ^ 


This step can be computed explicitly using the block-wise soft-thresholding opera¬ 
tor, the proximal operator of the group LASSO JHl, namely. 


5(z,t) = argmin :^|i C - z|l2 +'^^11 CIb = 





where a+ = max(a,0) and T > 0 controls the amount of shrinkage towards zero. 

For model selection purposes, one typically studies the sequence of cluster as¬ 
signments given by coalescent patterns of j3, or the sparse patterns in the first dif¬ 
ferences of j3, as A varies. We then seek to quickly approximate this sequence of 
active sets given by the coalescent patterns of J3 with our Algorithmic Regulariza¬ 
tion Paths, summarized in Algorithm]^ 
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Algorithm 5 Algorithmic Regularization Path for Convex Clustering 

1. Initialize z?. = 0, u° = 0, =e,k= 1, and set f > 0. 

2. While ||z'^||f>0: 

for all i do 

= [Ti;y<- + ifiiy] + rb -^Ki[4+4l] 

end for 
for all i < j do 

u^‘=u^ + [z^‘-(Pb'-Pb')], 

end for 

y(*+l) = ty(k) 

3. Output \ zj^ > as the algorithm path . 


As in the general case, we can use iterates of the z-subproblem to approximate 
a sparse sequence of cluster assignments. Given z, we can determine a clustering 
assignment in time that is linear in the number of data points n. We simply apply 
breadth-first search to identify the connected components of the following graph 
induced by the z. The graph identifies a node with every data point and places an 
edge between the /th and jth node if and only if z,y = 0. Each connected component 
corresponds to a cluster. 

We now illustrate on a simulated “halfmoon” data set of n = 200 points in 
that computing our Algorithm Path can lead to non-trivial computational cost sav¬ 
ings for obtaining a sequence of clustering assingments. We first detail some pre¬ 
liminaries. Although we do not take the space to discuss it here, in practice the 
choice of weights is very important. This topic is explored in 0 , and we use the 
sparse kernel weights which were shown to work well empirically in that paper. 
We created a geometric sequence of parameters and namely given a fixed 
multiplicative factor t > 1, we set A ^ = f A . The sequence 7 *^^) was constructed 
similarly, although we study our Algorithm Paths for several multiplicative factors, 
t e {1.1,1.05,1.01}. 

In contrast to the regularization path, the Algorithm Path does not require any 
convergence checks since only one step is taken at each grid point. Nonetheless, we 
only report the number of rounds of ADMM updates taken by each approach. The 
Algorithm Path took 259,1294, and 2,536 rounds of updates for the three step sizes 
considered; in contrast, the regularization path even for a very modest tolerance 
level, lO^'^, required a grand total of 30,008 rounds of updates, substantially more 
than our approach. 

Figure|^shows the ADMM Algorithmic Regularization paths and regularization 
path respectively for this simulated example. For each data point i we plot the se¬ 
quence of the segments between consecutive estimates of its center, namely 
and jSf. These paths begin to overlap and merge into “trunks” when center esti¬ 
mates for close-by data points begin to coincide as the parameters A^*^) and 
becomes sufficiently large. For sufficiently small step sizes for the regularization 
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levels the Algorithm Path and regularization path are strikingly similar, as expected 
and demonstrated previously in our other examples. For larger step sizes, however, 
the paths differ markedly, but still appear to capture the same clustering assign¬ 
ments. Overall, although the simulated data is relatively small, computing the whole 
regularization path, even for a modest stopping tolerance can, requires an order of 
magnitude more iterations and computational time than the Algorithm Path. 



Fig. 4 Convex clustering on simulated data: In the first three panels (from left to right, top to 
bottom), lines trace the ADMM Algorithmic Regularization path of the individual cluster centers 
as the algorithm path parameter / increases fort =1.1 (large), 1.05 (medium), and 1.01 (small). In 
the panel in the lower right corner, the lines trace the regularization path of the individual cluster 
centers as the regularization parameter X increases. 


4 Discussion 

In this chapter, we have presented a novel framework for approximating the se¬ 
quence of active sets associated with regularization paths of sparse statistical learn¬ 
ing problems. Instead of solving optimization problems over a grid of penalty pa¬ 
rameters as in traditional regularization paths, our algorithm performs a series of 
one-step approximations to an ADMM algorithm employing warm-starts with the 
goal of estimating a good sequence of sparse models. Our approach has a number of 
advantages including easy implementation, exploration of the sparse model space at 
a fine resolution, and most importantly fast compute times; we have demonstrated 
these advantages through several sparse statistical learning examples. 
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In our demonstrations, we have focused simply on computing the full sequence 
of active sets corresponding to the regularization path which is the critical compu¬ 
tationally intensive step in the process of model selection. Once the sequence of 
sparse models has been found, common methods for model selection such as AIC, 
BIC, cross-validation and stability selection, can be employed to choose the optimal 
model. We note that with regularization paths, model selection procedures typically 
choose the optimal A which indexes the optimal sparse model. For our Algorithm 
Paths which do not directly solve regularized statistical problems, model selection 
procedures should be used to choose the optimal iteration, k, and the corresponding 
sparse model given by the active set of z^. While this chapter has focused on finding 
the sequence of sparse models via our Algorithm Paths, we plan to study using these 
paths in conjunction with common model selection procedures in future work. 

As the ADMM algorithm has been widely used for sparse statistical learn¬ 
ing problems, the mechanics are in place for broad application of our Algorithm 
Paths which utilize the three standard ADMM subproblems. Indeed, our approach 
could potentially yield substantial computational savings for any ADMM applica¬ 
tion where the /3 and z can be solved efficiently. Furthermore, there has been much 
recent interest in distributed versions of ADMM algorithms ll^l26l . Thus, there is 
the potential to use these in conjunction with our problem to distribute computation 
in the j3 and z subproblems and further speed computations for Big-Data prob¬ 
lems. Also, we have focused on developing our Algorithm Path for sparse statistical 
learning problems that can be written as a composite of a smooth loss function 
and a non-smooth, convex penalty. Our methods, however, can be easily extended 
to study constrained statistical learning problems, such as that of the support vec¬ 
tor machines. Finally, our framework utilizes the ADMM splitting method, but the 
strategies we develop could also be useful for computing a sequence of sparse mod¬ 
els using other operator splitting algorithms. 

Our work raises many questions from statistical and optimization perspectives. 
Further work needs to be done to characterize and study the mathematical properties 
of the Algorithm Paths as well as relate them to existing optimization procedures and 
algorithms. For example, ADMM is just one of many variants of proximal methods 
II 29 I . We suspect that other variants, such as proximal gradient descent, used to fit 
sparse models will also benefit from an Algorithm Path approach in expediting the 
model selection procedure. We leave this as future work. 

In our demonstrations in Section]^ we suggested empirically that our Algorithm 
Paths with a tiny step size closely approximate the sequence of active sets associated 
with regularization paths. Further work needs to be done to verify this connection 
mathematically. Along these lines, a key practical question is how to choose the 
appropriate step size for increasing the amount of regularization as the algorithm 
progresses. As we have demonstrated, changing the step size yields paths with very 
different solutions and behaviors that warrant further investigation. For now, our 
recommendation is to employ a fairly small step size as these well-approximate the 
traditional regularization paths in all of the examples we have studied. Additionally, 
our approach may be related to other new proposals for computing regularization 
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paths based on partial differential equations, for example ll32l : these potential con¬ 
nections merit further investigation. 

Our work also raises a host of interesting statistical questions as well. The sparse 
regression example suggested that Algorithm Paths may not simply yield computa¬ 
tional savings, but may also perform better in terms of variable selection. This raises 
an interesting statistical prospect that we plan to carefully study in future work. 

To conclude, we have introduced a novel approach to approximating the sequence 
of active sets associated with regularization paths for large-scale sparse statistical 
learning procedures. Our methods yield substantial computational savings and raise 
a number of interesting open questions for future research. 
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